Fligner–Killeen Test (robust test for equal variances)#

The Fligner–Killeen test checks whether multiple groups share the same variance (homoscedasticity). It is a rank-based, median-centered test designed to stay reliable when data are non-normal or contain outliers.

Goals#

  • Understand the hypotheses and interpretation (H0: equal variances).

  • Build intuition from absolute deviations and ranks.

  • Implement the test statistic from scratch with NumPy (no stats libraries).

  • Use Plotly visuals + simulations to see what the statistic is measuring.

import math
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio


pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) What question does the test answer?#

You have k independent groups (e.g., treatments, machines, cohorts) and want to know:

  • Are some groups more variable (more spread out) than others?

Formally, the hypotheses are:

\begin{aligned} H_0 &: \sigma_1^2 = \sigma_2^2 = \dots = \sigma_k^2 \ H_1 &: \text{at least one variance differs} \end{aligned}

The test returns a single p-value.

  • Small p-value → evidence that variances are not all equal.

  • Large p-value → not enough evidence to say variances differ (this is not proof they are equal).

Why this matters:

  • Many models/analyses (e.g., one-way ANOVA, some regression diagnostics) assume equal variances.

  • In practice, data often violate normality; Fligner–Killeen is popular because it’s robust.

2) Intuition: compare “how far from typical” each group is#

If a group has larger variance, its observations tend to sit farther away from a typical value.

Fligner–Killeen makes this idea robust:

  1. Center each group by its median (robust “typical value”).

  2. Look at absolute deviations from that median.

  3. Replace deviations by their ranks across all groups (robust to outliers).

  4. Convert ranks to half-normal scores (so the final statistic has a chi-square reference).

Let’s build intuition on a heavy-tailed example (Student-t), with one group having a larger scale.

# Example data: heavy-tailed (t distribution), with a location shift in group B
n = 70
df_t = 3

gA = rng.standard_t(df=df_t, size=n) * 1.0 + 0.0
gB = rng.standard_t(df=df_t, size=n) * 1.0 + 0.6  # location shift only
gC = rng.standard_t(df=df_t, size=n) * 2.0 + 0.0  # larger scale (variance)

groups = [gA, gB, gC]
group_names = np.array(["A", "B", "C"])


def mad(x: np.ndarray) -> float:
    x = np.asarray(x, dtype=float)
    m = np.median(x)
    return float(np.median(np.abs(x - m)))


for name, g in zip(group_names, groups):
    print(
        f"{name}: n={len(g):d}, mean={g.mean(): .3f}, median={np.median(g): .3f}, "
        f"std={g.std(ddof=1): .3f}, MAD={mad(g): .3f}"
    )
A: n=70, mean=-0.107, median=-0.048, std= 1.325, MAD= 0.673
B: n=70, mean= 0.725, median= 0.574, std= 1.993, MAD= 0.867
C: n=70, mean=-0.333, median=-0.389, std= 4.030, MAD= 1.517
x = np.concatenate(groups)
labels = np.concatenate([np.full(len(g), name) for g, name in zip(groups, group_names)])

fig = px.violin(
    x=labels,
    y=x,
    color=labels,
    box=True,
    points="all",
    title="Raw data by group (B is shifted; C has larger scale)",
)
fig.update_layout(xaxis_title="Group", yaxis_title="Value", legend_title_text="Group")
fig.show()

3) Step 1: absolute deviations from each group’s center#

For group i and observation j:

[ Z_{ij} = |X_{ij} - c_i| ]

where the default Fligner–Killeen choice is:

[ c_i = \text{median}(X_{i1}, \dots, X_{in_i}). ]

If group variances are equal, the distribution of these deviations should be similar across groups.

k = len(groups)
group_ids = np.concatenate([np.full(len(g), i) for i, g in enumerate(groups)])

centers = np.array([np.median(g) for g in groups])
z = np.abs(x - centers[group_ids])

fig = px.box(
    x=labels,
    y=z,
    color=labels,
    points="all",
    title="Absolute deviations from each group median",
)
fig.update_layout(xaxis_title="Group", yaxis_title="|x - median(group)|", legend_title_text="Group")
fig.show()

4) Step 2: rank the deviations (robustness)#

Outliers can wildly inflate sample variances. To reduce sensitivity to tails, Fligner–Killeen replaces deviations by their ranks across all groups.

  • Small deviations → small ranks

  • Large deviations → large ranks

Ties (identical deviations) get average ranks.

We’ll implement rankdata ourselves with NumPy.

def rankdata_average(a: np.ndarray) -> np.ndarray:
    # Average ranks for ties (like scipy.stats.rankdata(method='average'))
    a = np.asarray(a)
    if a.ndim != 1:
        a = a.ravel()

    sorter = np.argsort(a, kind="mergesort")
    a_sorted = a[sorter]

    # Boundaries between distinct values
    distinct = np.r_[True, a_sorted[1:] != a_sorted[:-1], True]
    idx = np.flatnonzero(distinct)

    ranks_sorted = np.empty_like(a_sorted, dtype=float)
    for start, end in zip(idx[:-1], idx[1:]):
        # ranks are 1..n; tie block gets the average rank
        ranks_sorted[start:end] = 0.5 * (start + end - 1) + 1.0

    ranks = np.empty_like(ranks_sorted)
    ranks[sorter] = ranks_sorted
    return ranks


ex = np.array([10, 20, 20, 40])
print("values:", ex)
print("ranks :", rankdata_average(ex))
values: [10 20 20 40]
ranks : [1.  2.5 2.5 4. ]

5) Step 3: convert ranks to half-normal scores#

Fligner–Killeen maps ranks to expected order statistics of (|N(0,1)|) (a half-normal distribution).

If (U \sim \mathrm{Unif}(0,1)), then:

[ |Z| \stackrel{d}{=} \Phi^{-1}\left(\tfrac{1}{2} + \tfrac{U}{2}\right) ]

where (\Phi^{-1}) is the standard normal quantile function.

Using ranks (R_{ij}\in{1,\dots,N}) (with (N) total samples), we set:

[ A_{ij} = \Phi^{-1}\left(\tfrac{1}{2} + \tfrac{R_{ij}}{2(N+1)}\right) ]

so all scores are nonnegative (matching deviations).

Below is a standard rational approximation for (\Phi^{-1}) (Acklam’s approximation).

def norm_ppf(p: np.ndarray) -> np.ndarray:
    # Approx inverse-CDF (ppf) of N(0,1) using Acklam's rational approximation
    p = np.asarray(p, dtype=float)
    eps = np.finfo(float).eps
    p = np.clip(p, eps, 1.0 - eps)

    # Coefficients in rational approximations
    a = np.array([
        -3.969683028665376e01,
        2.209460984245205e02,
        -2.759285104469687e02,
        1.383577518672690e02,
        -3.066479806614716e01,
        2.506628277459239e00,
    ])
    b = np.array([
        -5.447609879822406e01,
        1.615858368580409e02,
        -1.556989798598866e02,
        6.680131188771972e01,
        -1.328068155288572e01,
    ])
    c = np.array([
        -7.784894002430293e-03,
        -3.223964580411365e-01,
        -2.400758277161838e00,
        -2.549732539343734e00,
        4.374664141464968e00,
        2.938163982698783e00,
    ])
    d = np.array([
        7.784695709041462e-03,
        3.224671290700398e-01,
        2.445134137142996e00,
        3.754408661907416e00,
    ])

    plow = 0.02425
    phigh = 1.0 - plow

    x = np.empty_like(p)

    # Lower region
    mask = p < plow
    if np.any(mask):
        q = np.sqrt(-2.0 * np.log(p[mask]))
        x[mask] = (
            (((((c[0] * q + c[1]) * q + c[2]) * q + c[3]) * q + c[4]) * q + c[5])
            /
            ((((d[0] * q + d[1]) * q + d[2]) * q + d[3]) * q + 1.0)
        )

    # Upper region
    mask = p > phigh
    if np.any(mask):
        q = np.sqrt(-2.0 * np.log(1.0 - p[mask]))
        x[mask] = -(
            (((((c[0] * q + c[1]) * q + c[2]) * q + c[3]) * q + c[4]) * q + c[5])
            /
            ((((d[0] * q + d[1]) * q + d[2]) * q + d[3]) * q + 1.0)
        )

    # Central region
    mask = (~(p < plow)) & (~(p > phigh))
    if np.any(mask):
        q = p[mask] - 0.5
        r = q * q
        x[mask] = (
            (((((a[0] * r + a[1]) * r + a[2]) * r + a[3]) * r + a[4]) * r + a[5]) * q
            /
            (((((b[0] * r + b[1]) * r + b[2]) * r + b[3]) * r + b[4]) * r + 1.0)
        )

    return x


ps = np.array([1e-4, 1e-2, 0.5, 0.99, 0.9999])
print("p:", ps)
print("ppf(p):", norm_ppf(ps))
p: [0.0001 0.01   0.5    0.99   0.9999]
ppf(p): [-3.719  -2.3263  0.      2.3263  3.719 ]
ranks = rankdata_average(z)
N = len(z)

# Half-normal scores (match the classic Fligner–Killeen definition)
u = ranks / (N + 1.0)
a = norm_ppf(0.5 + 0.5 * u)

fig = go.Figure()
fig.add_trace(go.Scatter(x=z, y=a, mode="markers", marker=dict(size=6, opacity=0.5)))
fig.update_layout(
    title="Deviation size vs half-normal score (after ranking)",
    xaxis_title="Absolute deviation |x - median(group)|",
    yaxis_title="Score  Φ⁻¹(0.5 + rank/(2(N+1)))",
)
fig.show()

fig = px.box(
    x=labels,
    y=a,
    color=labels,
    points="all",
    title="Half-normal scores by group (differences reflect scale differences)",
)
fig.update_layout(xaxis_title="Group", yaxis_title="Half-normal score", legend_title_text="Group")
fig.show()

6) The Fligner–Killeen statistic#

Once we have the scores (A_{ij}), the test becomes a one-way “ANOVA on scores”:

  • Let (\bar{A}_i) be the mean score within group i.

  • Let (\bar{A}) be the overall mean score.

The statistic is

[ T = \frac{\sum_{i=1}^k n_i(\bar{A}_i - \bar{A})^2}{s_A^2} ]

where (s_A^2) is the sample variance of all scores (ddof=1).

Under (H_0), (T) is approximately

[ T \sim \chi^2_{k-1}. ]

We’ll implement the statistic with NumPy, then (later) validate against SciPy.

def fligner_killeen_statistic(*samples: np.ndarray, center: str = "median") -> tuple[float, int]:
    # Fligner–Killeen test statistic (NumPy-only implementation)
    if len(samples) < 2:
        raise ValueError("Need at least two groups.")
    if center not in {"median", "mean"}:
        raise ValueError("center must be 'median' or 'mean'.")

    cleaned = []
    for s in samples:
        s = np.asarray(s, dtype=float).ravel()
        s = s[~np.isnan(s)]
        if s.size == 0:
            raise ValueError("Groups must be non-empty after removing NaNs.")
        cleaned.append(s)

    k = len(cleaned)
    ni = np.array([len(s) for s in cleaned], dtype=int)
    if np.any(ni < 2):
        raise ValueError("Each group should contain at least 2 observations.")

    x_all = np.concatenate(cleaned)
    group_ids = np.concatenate([np.full(len(s), i, dtype=int) for i, s in enumerate(cleaned)])

    if center == "median":
        centers = np.array([np.median(s) for s in cleaned])
    else:
        centers = np.array([np.mean(s) for s in cleaned])

    z = np.abs(x_all - centers[group_ids])
    ranks = rankdata_average(z)
    N = len(z)

    # Half-normal scores
    u = ranks / (N + 1.0)
    a = norm_ppf(0.5 + 0.5 * u)

    # Group mean scores
    a_bar = float(np.mean(a))
    a_bar_i = np.array([np.mean(a[group_ids == i]) for i in range(k)])

    varsq = float(np.var(a, ddof=1))
    statistic = float(np.sum(ni * (a_bar_i - a_bar) ** 2) / varsq)
    df = k - 1
    return statistic, df


stat_fk, df_fk = fligner_killeen_statistic(gA, gB, gC, center="median")
print(f"Fligner–Killeen statistic = {stat_fk:.4f} (df={df_fk})")
Fligner–Killeen statistic = 19.8165 (df=2)
def norm_cdf_scalar(x: float) -> float:
    return 0.5 * (1.0 + math.erf(x / math.sqrt(2.0)))


def chi2_sf_wilson_hilferty(x: float, df: int) -> float:
    # Approximate chi-square survival function using the Wilson–Hilferty transform
    v = float(df)
    z = ((x / v) ** (1.0 / 3.0) - (1.0 - 2.0 / (9.0 * v))) / math.sqrt(2.0 / (9.0 * v))
    return 1.0 - norm_cdf_scalar(z)


p_approx = chi2_sf_wilson_hilferty(stat_fk, df_fk)
print(f"Approx p-value (Wilson–Hilferty) ≈ {p_approx:.6f}")
Approx p-value (Wilson–Hilferty) ≈ 0.000079

7) How to interpret the result#

  • If p-value < α (often α = 0.05): reject (H_0) → evidence that not all group variances are equal.

  • If p-value α: fail to reject (H_0) → the data do not provide strong evidence of different variances.

Important nuances:

  • “Fail to reject” does not mean variances are equal; the test might have low power for small samples.

  • The test is omnibus: it doesn’t tell you which groups differ. Use visuals and (optionally) pairwise follow-ups with multiplicity control.

# A practical diagnostic: summarize group spread with a few robust + non-robust measures
spread_names = ["std", "IQR", "MAD (median abs dev)"]
spread = np.zeros((len(groups), len(spread_names)))

for i, g in enumerate(groups):
    spread[i, 0] = np.std(g, ddof=1)
    q75, q25 = np.percentile(g, [75, 25])
    spread[i, 1] = q75 - q25
    spread[i, 2] = mad(g)

fig = go.Figure()
for j, metric in enumerate(spread_names):
    fig.add_trace(go.Bar(name=metric, x=group_names, y=spread[:, j]))

fig.update_layout(
    barmode="group",
    title="Group spread summaries (diagnostics, not the test itself)",
    xaxis_title="Group",
    yaxis_title="Spread",
)
fig.show()

Location differences alone should not trigger the test#

Fligner–Killeen centers each group (by default at the median), so pure location shifts should not systematically change the deviation ranks.

Here’s a quick example: same heavy-tailed distribution and same scale, but different medians.

rng_loc = np.random.default_rng(2)

n_loc = 80
hA = rng_loc.standard_t(df=df_t, size=n_loc) * 1.0 + -1.0
hB = rng_loc.standard_t(df=df_t, size=n_loc) * 1.0 + 0.5
hC = rng_loc.standard_t(df=df_t, size=n_loc) * 1.0 + 2.0

stat_loc, df_loc = fligner_killeen_statistic(hA, hB, hC, center="median")
p_loc_approx = chi2_sf_wilson_hilferty(stat_loc, df_loc)

print(f"Location-only example: stat={stat_loc:.4f} (df={df_loc}), approx p≈{p_loc_approx:.4f}")

x_loc = np.concatenate([hA, hB, hC])
labels_loc = np.concatenate(
    [
        np.full(len(hA), "A (shift -1)"),
        np.full(len(hB), "B (shift +0.5)"),
        np.full(len(hC), "C (shift +2)"),
    ]
)

fig = px.violin(
    x=labels_loc,
    y=x_loc,
    color=labels_loc,
    box=True,
    points="all",
    title="Different locations, same scale (visual check)",
)
fig.update_layout(xaxis_title="Group", yaxis_title="Value", showlegend=False)
fig.show()
Location-only example: stat=0.8041 (df=2), approx p≈0.6745

8) What does the statistic “look like” under H0 vs H1?#

Under (H_0), the reference distribution is approximately (\chi^2_{k-1}). Under (H_1), the statistic tends to be larger because group score means separate.

We can see this by simulation.

(These simulations are for intuition; the test itself uses the chi-square approximation.)

def chi2_pdf(x: np.ndarray, df: int) -> np.ndarray:
    # Chi-square pdf using only NumPy + math.gamma (for plotting)
    x = np.asarray(x, dtype=float)
    v = float(df)
    c = 1.0 / (2.0 ** (v / 2.0) * math.gamma(v / 2.0))
    return c * np.power(x, v / 2.0 - 1.0) * np.exp(-x / 2.0)


def simulate_fk_stats(
    *,
    n_rep: int,
    n: int,
    df_t: int,
    scale_c: float,
    seed: int = 0,
) -> np.ndarray:
    sim_rng = np.random.default_rng(seed)
    out = np.empty(n_rep, dtype=float)
    for r in range(n_rep):
        a = sim_rng.standard_t(df=df_t, size=n)
        b = sim_rng.standard_t(df=df_t, size=n)
        c = sim_rng.standard_t(df=df_t, size=n) * scale_c
        out[r], _ = fligner_killeen_statistic(a, b, c)
    return out


n_rep = 1500
n_sim = 40
stats_h0 = simulate_fk_stats(n_rep=n_rep, n=n_sim, df_t=df_t, scale_c=1.0, seed=1)
stats_h1 = simulate_fk_stats(n_rep=n_rep, n=n_sim, df_t=df_t, scale_c=2.0, seed=2)

df_ref = 2  # k-1 for k=3 groups
x_grid = np.linspace(0, np.percentile(stats_h1, 99.5), 400)

fig = go.Figure()
fig.add_trace(go.Histogram(x=stats_h0, name="H0: equal variances", opacity=0.55, nbinsx=50))
fig.add_trace(go.Histogram(x=stats_h1, name="H1: group C has larger variance", opacity=0.55, nbinsx=50))
fig.add_trace(go.Scatter(x=x_grid, y=chi2_pdf(x_grid, df_ref), name=f"χ²(df={df_ref}) pdf", mode="lines"))

fig.update_layout(
    barmode="overlay",
    title="Simulated Fligner–Killeen statistic under H0 vs H1 (and χ² reference)",
    xaxis_title="Test statistic",
    yaxis_title="Density / count",
)
fig.show()

9) A simple power curve (variance ratio)#

One way to think about “power” here:

  • Fix a significance level (\alpha) (say 0.05).

  • Under different variance ratios, simulate the probability we reject (H_0).

This is not an exact analysis (it depends on distribution, n, etc.), but it builds intuition.

def chi2_ppf_wilson_hilferty(p: float, df: int) -> float:
    # Approximate chi-square quantile via Wilson–Hilferty (uses our norm_ppf)
    v = float(df)
    z = float(norm_ppf(np.array([p]))[0])
    return v * (1.0 - 2.0 / (9.0 * v) + z * math.sqrt(2.0 / (9.0 * v))) ** 3


alpha = 0.05
df_ref = 2
crit = chi2_ppf_wilson_hilferty(1.0 - alpha, df_ref)

ratios = np.array([1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 3.0])
n_rep = 600
n_sim = 40

power = []
for r, ratio in enumerate(ratios):
    stats_alt = simulate_fk_stats(
        n_rep=n_rep,
        n=n_sim,
        df_t=df_t,
        scale_c=float(ratio),
        seed=100 + r,
    )
    power.append(float(np.mean(stats_alt > crit)))

power = np.array(power)

fig = go.Figure()
fig.add_trace(go.Scatter(x=ratios, y=power, mode="lines+markers"))
fig.update_layout(
    title=f"Approx power vs variance ratio (n={n_sim}, reps={n_rep}, α={alpha})",
    xaxis_title="Scale multiplier of group C (≈ sqrt(variance ratio))",
    yaxis_title="Rejection rate",
    yaxis=dict(range=[0, 1]),
)
fig.show()

10) Practical usage (SciPy) + validation#

SciPy provides a production-ready implementation:

  • scipy.stats.fligner(sample1, sample2, ...)

We’ll compare its result to our NumPy implementation.

from scipy import stats

stat_scipy, p_scipy = stats.fligner(gA, gB, gC, center="median")
print(f"SciPy fligner: statistic={stat_scipy:.6f}, p-value={p_scipy:.6g}")
print(f"Our FK stat : statistic={stat_fk:.6f} (df={df_fk})")
SciPy fligner: statistic=19.816476, p-value=4.9763e-05
Our FK stat : statistic=19.816476 (df=2)

Quick comparison to other variance tests#

  • Bartlett: powerful under normality, but can be very sensitive to non-normality.

  • Levene: robust parametric alternative (uses absolute deviations; can center at mean/median).

  • Fligner–Killeen: very robust (rank-based, median-centered by default).

On heavy-tailed data, Bartlett can have inflated false positives. Let’s check this empirically with a small simulation.

def rejection_rates(*, n_rep: int, n: int, df_t: int, alpha: float, seed: int = 0) -> dict[str, float]:
    sim_rng = np.random.default_rng(seed)

    rej = {"bartlett": 0, "levene_median": 0, "fligner": 0}
    for _ in range(n_rep):
        a = sim_rng.standard_t(df=df_t, size=n)
        b = sim_rng.standard_t(df=df_t, size=n)
        c = sim_rng.standard_t(df=df_t, size=n)

        # H0 is true here: all groups have the same distribution/variance
        _, p_b = stats.bartlett(a, b, c)
        _, p_l = stats.levene(a, b, c, center="median")
        _, p_f = stats.fligner(a, b, c, center="median")

        rej["bartlett"] += int(p_b < alpha)
        rej["levene_median"] += int(p_l < alpha)
        rej["fligner"] += int(p_f < alpha)

    return {k: v / n_rep for k, v in rej.items()}


rates = rejection_rates(n_rep=600, n=35, df_t=df_t, alpha=0.05, seed=123)
rates
{'bartlett': 0.5383333333333333,
 'levene_median': 0.045,
 'fligner': 0.04833333333333333}
fig = px.bar(
    x=list(rates.keys()),
    y=list(rates.values()),
    title="Empirical type I error on heavy-tailed data (H0 true)",
)
fig.add_hline(y=0.05, line_dash="dash", line_color="black")
fig.update_layout(xaxis_title="Test", yaxis_title="Rejection rate (should be ≈ 0.05)")
fig.show()

11) Pitfalls and good practice#

  • The test assumes independent observations within/between groups.

  • It is an omnibus test: if you reject, use plots and follow-ups to locate differences.

  • If groups have very different shapes (not just scale), variance tests can reject for reasons that feel surprising. Always look at the data (violin/box/QQ plots, etc.).

  • With tiny samples, the test can have low power.

If you reject equal variances, common next steps:

  • Use methods that don’t assume equal variances (e.g., Welch’s ANOVA).

  • Consider variance-stabilizing transforms (log, sqrt) when appropriate.

  • Model heteroscedasticity directly (GLS, heteroscedastic regression, etc.).

12) Exercises#

  1. Create 4 groups with equal variance but different medians; confirm the test typically does not reject.

  2. Hold the variance ratio fixed and vary n to see how power changes.

  3. Replace the t-distribution with a skewed distribution (e.g., lognormal) and compare Bartlett/Levene/Fligner.

  4. Add a single extreme outlier to one group; compare how the sample variance vs MAD vs the tests respond.

References#

  • Fligner, M. A., & Killeen, T. J. (1976). Distribution-free two-sample tests for scale.

  • Conover, W. J., Johnson, M. E., & Johnson, M. M. (1981). A comparative study of tests for homogeneity of variances.

  • SciPy documentation: scipy.stats.fligner